Loading libraries
Setting data location, make sure the files below are available so the rest runs.
datadir <- "../data"
list.files(datadir)
[1] "detroit-311.csv" "detroit-blight-violations.csv"
[3] "detroit-crime.csv" "detroit-demolition-permits.tsv"
[5] "detroit-PARCELS.csv"
Loading all data
Data permits exploration
head(data_permits)
First converting data fields to the right types
#converting to factors
cols <- c("CASE_TYPE", "CASE_DESCRIPTION", "LEGAL_USE", "BLD_PERMIT_TYPE",
"PERMIT_DESCRIPTION", "BLD_PERMIT_DESC", "BLD_TYPE_USE", "RESIDENTIAL",
"DESCRIPTION", "BLD_TYPE_CONST_COD", "BLD_ZONING_DIST", "BLD_USE_GROUP",
"BLD_BASEMENT", "FEE_TYPE", "CSF_CREATED_BY","CONDITION_FOR_APPROVAL")
data_permits %<>% mutate_each_(funs(factor(.)),cols)
#converting $$ to numeric
cols <- c("PCF_AMT_PD", "PCF_AMT_DUE", "PCF_UPDATED","ESTIMATED_COST")
data_permits %<>% mutate_each_(funs(from_currency(.)),cols)
#converting to dates
cols <-c("PERMIT_APPLIED","PERMIT_ISSUED","PERMIT_EXPIRES")
data_permits %<>% mutate_each_(funs(parse_date_time(.,orders="mdy",tz="America/Detroit")),cols)
3 failed to parse. 3 failed to parse. 3 failed to parse.
summary(data_permits)
PERMIT_NO PERMIT_APPLIED PERMIT_ISSUED
Length:7133 Min. :2010-05-20 00:00:00 Min. :2010-05-20 00:00:00
Class :character 1st Qu.:2012-08-15 00:00:00 1st Qu.:2012-08-15 00:00:00
Mode :character Median :2013-03-12 00:00:00 Median :2013-03-12 00:00:00
Mean :2013-07-07 20:39:12 Mean :2013-07-07 22:37:45
3rd Qu.:2014-10-15 00:00:00 3rd Qu.:2014-10-17 00:00:00
Max. :2015-08-28 00:00:00 Max. :2015-08-28 00:00:00
NA's :3 NA's :3
PERMIT_EXPIRES SITE_ADDRESS BETWEEN1 PARCEL_NO
Min. :2010-11-20 00:00:00 Length:7133 Length:7133 Length:7133
1st Qu.:2013-02-23 00:00:00 Class :character Class :character Class :character
Median :2013-09-08 00:00:00 Mode :character Mode :character Mode :character
Mean :2013-12-03 12:13:46
3rd Qu.:2015-02-28 00:00:00
Max. :2016-12-05 00:00:00
NA's :183
LOT_NUMBER SUBDIVISION CASE_TYPE CASE_DESCRIPTION
Length:7133 Length:7133 BLD:7133 Building Permit:7133
Class :character Class :character
Mode :character Mode :character
LEGAL_USE ESTIMATED_COST PARCEL_SIZE PARCEL_CLUSTER_SECTOR
ONE FAMILY DWELLING :2709 Min. : 0 Min. : 0 Min. : 1.000
RES : 920 1st Qu.: 4525 1st Qu.: 3354 1st Qu.: 2.000
ONE FAMILY : 798 Median : 13470 Median : 4095 Median : 5.000
TWO FAMILY DWELLING : 479 Mean : 18955 Mean : 13540 Mean : 4.656
SINGLE FAMILY DWELLING: 358 3rd Qu.: 18750 3rd Qu.: 4932 3rd Qu.: 7.000
(Other) :1425 Max. :448200 Max. :2554576 Max. :10.000
NA's : 444 NA's :6791 NA's :45 NA's :11
STORIES PARCEL_FLOOR_AREA PARCEL_GROUND_AREA PRC_AKA_ADDRESS BLD_PERMIT_TYPE
Min. : 1.000 Min. : 0 Min. : 0.0 Length:7133 DISM :5859
1st Qu.: 1.000 1st Qu.: 0 1st Qu.: 630.0 Class :character Dismantle:1274
Median : 1.500 Median : 0 Median : 768.0 Mode :character
Mean : 2.126 Mean : 6830 Mean : 742.3
3rd Qu.: 2.000 3rd Qu.: 0 3rd Qu.: 912.0
Max. :26.000 Max. :5102782 Max. :27316.0
NA's :1426 NA's :45 NA's :45
PERMIT_DESCRIPTION BLD_PERMIT_DESC BLD_TYPE_USE
Dismantle:5859 WRECK AND REMOVE DEBRIS:3077 28 :4265
NA's :1274 WRECK & REMOVE DEBRIS :2060 One Family Dwelling:1121
WRECK & REMOVE DEBRIS. : 469 54 : 944
Wrecking Permit : 292 6 : 260
WRECK AND REMOVE : 31 26 : 92
(Other) : 213 (Other) : 399
NA's : 991 NA's : 52
RESIDENTIAL DESCRIPTION BLD_TYPE_CONST_COD BLD_ZONING_DIST
NON-RESIDENTIAL: 659 One Family Dwelling:4265 5B :6456 R1 :1350
RESIDENTIAL :6474 Two Family Dwelling: 944 3B : 323 R2 : 790
Commercial Building: 260 2B : 61 B4 : 96
Multiple Dwelling : 92 1B : 35 R3 : 44
School : 42 5A : 23 M4 : 29
(Other) : 256 (Other): 34 (Other): 104
NA's :1274 NA's : 201 NA's :4720
BLD_USE_GROUP BLD_BASEMENT FEE_TYPE CSM_CASENO CSF_CREATED_BY SEQ_NO
R3 :6216 N :1051 WPMT :7037 Length:7133 L-FW :1636 Min. :1
M : 98 Y :5226 BCP : 27 Class :character RSA :1435 1st Qu.:1
R2 : 72 NA's: 856 BPM5 : 12 Mode :character GRE :1105 Median :1
B : 60 REIN : 9 H-AP : 641 Mean :1
E : 53 SPEC : 9 M-JD : 488 3rd Qu.:1
(Other): 189 WPM2 : 9 M-LR : 444 Max. :1
NA's : 445 (Other): 30 (Other):1384
PCF_AMT_PD PCF_AMT_DUE PCF_UPDATED OWNER_LAST_NAME
Min. : 0.0 Min. : 0.0 Min. : 1313 Length:7133
1st Qu.: 108.0 1st Qu.: 238.0 1st Qu.: 8615 Class :character
Median : 238.0 Median : 238.0 Median : 41211 Mode :character
Mean : 257.3 Mean : 274.1 Mean : 90674
3rd Qu.: 238.0 3rd Qu.: 238.0 3rd Qu.: 72085
Max. :20494.2 Max. :20494.2 Max. :101112414
NA's :1274 NA's :471
OWNER_FIRST_NAME OWNER_ADDRESS1 OWNER_ADDRESS2 OWNER_CITY
Length:7133 Length:7133 Length:7133 Length:7133
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
OWNER_STATE OWNER_ZIP CONTRACTOR_LAST_NAME CONTRACTOR_FIRST_NAME
Length:7133 Length:7133 Length:7133 Length:7133
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
CONTRACTOR_ADDRESS1 CONTRACTOR_ADDRESS2 CONTRACTOR_CITY CONTRACTOR_STATE
Length:7133 Length:7133 Length:7133 Length:7133
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
CONTRACTOR_ZIP
Min. :48009
1st Qu.:48204
Median :48209
Mean :48230
3rd Qu.:48227
Max. :92606
NA's :179
CONDITION_FOR_APPROVAL
Call for inspection upon compliance.Ord. 290-H, 12-11-14.3, ET SEQ.CC: OWNER: Borman LLC. 30600 Telegraph Rd. Bingham Farms, MI 48025CC: TENANT: LD Acquistion Co. P.O. Box 3429 El Segunda, CA 90245: 1
Remove all foundations from site and backfill.Ord. 290-H, 12-11-19.9, ET SEQ. : 1
NA's :7131
site_location owner_location contractor_location geom
Length:7133 Length:7133 Length:7133 Length:7133
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
Extracting building lat longs
data_permits %<>%
#filter out permits that have no lat/long
filter(grepl("\\([0-9\\.\\-]+, *[0-9\\.\\-]+\\)",site_location)) %>%
#extracting lat longs
mutate(lat = as.double(sub(".*\\(([0-9\\.\\-]+),.*","\\1", site_location))) %>%
mutate(long = as.double(sub(".*, *([0-9\\.\\-]+).*","\\1", site_location))) %>%
mutate(address_only = sub("([^\\(]+)\\([0-9\\.\\-]+,.*","\\1", site_location))
Create a list of buildings doing some magic to remove those entries whose lat/long stdev is larger than 10e-4 (~11m). Not much gets removed actually, but it’s consistent with steps below. Removed those records whose address was a lat/long (only ~40).
bld_list_permit <- data_permits %>%
mutate(r = sqrt(PARCEL_SIZE/pi) ) %>%
select(address=address_only, PARCEL_NO, LOT_NUMBER, PERMIT_ISSUED, PARCEL_SIZE, lat, long, r) %>%
filter(! grepl("\\([0-9\\.\\-]+, *[0-9\\.\\-]+\\)",address)) %>%
arrange(address, desc(PERMIT_ISSUED)) %>%
group_by(address) %>%
mutate(sdlat=sd(lat), sdlong=sd(long)) %>%
filter((sdlat<10e-4 && sdlong<10e-4) || (is.na(sdlat) && is.na(sdlong))) %>%
filter(long > -83.3 && long < -82.8) %>%
filter(lat > 42.2 && lat < 42.5) %>%
arrange(PERMIT_ISSUED) %>%
summarise(n_permits=n(), last_permit=last(PERMIT_ISSUED),
lat=median(lat), long=median(long), r=last(r))
head(bld_list_permit)
Function to return number of buildings around a particular lat long within a predefined range (by default 500m or 1640ft) - input data, for simplicity, should have 4 fields - PARCEL_NO: assumed it’s a unique identifier for building) - lat: building lat coordinate - long: building long coordinate - r: estimated building’s area radious (when area is approximated to circle) - the other input variables are: - lt: target lat coordinate - ln: target long coordinate - rangeft (optional, range in feet to look buildings around target lat/long) - m_in_ft (option)
bld_in_range <- function(lt, ln, data, rangeft = 1640, m_in_ft = 0.3048) {
data %>%
rowwise() %>%
mutate(dist_ft = distGeo(c(ln,lt),c(long,lat))/m_in_ft) %>%
filter(dist_ft <= rangeft + r) %>%
nrow()
}
indata <- bld_list_permit %>%
filter(!is.na(r)) %>%
unique() %>%
select(address, lat, long, r)
Loading blight violation incidents
head(data_violations)
Data Violation exploration, converting first to the right data fields
#converting to factors
cols <- c("AgencyName","ViolationCode","Disposition","PaymentStatus","Void",
"ViolationCategory","Country")
data_violations %<>% mutate_each_(funs(factor(.)),cols)
#converting $$ to numeric
cols <- c("FineAmt","AdminFee","LateFee","StateFee","CleanUpCost","JudgmentAmt")
data_violations %<>% mutate_each_(funs(from_currency(.)),cols)
#not converting to dates as the dates in the fields below have weird years
cols <-c("TicketIssuedDT","HearingDT")
#data_violations %<>% mutate_each_(funs(from_currency(.)),cols)
summary(data_violations)
TicketID TicketNumber AgencyName
Min. : 18645 Length:307804 Building and Safety Engineering Department:173311
1st Qu.:101806 Class :character Department of Public Works :112757
Median :183824 Mode :character Detroit Police Department : 12853
Mean :182967 Health Department : 8881
3rd Qu.:265211 Neighborhood City Halls : 2
Max. :339184
ViolName ViolationStreetNumber ViolationStreetName MailingStreetNumber
Length:307804 Min. : -11064 Length:307804 Min. : 0
Class :character 1st Qu.: 4936 Class :character 1st Qu.: 3107
Mode :character Median : 10624 Mode :character Median : 9359
Mean : 12001 Mean : 18527
3rd Qu.: 15895 3rd Qu.: 18251
Max. :222222222 Max. :222222222
NA's :808
MailingStreetName MailingCity MailingState MailingZipCode
Length:307804 Length:307804 Length:307804 Length:307804
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
NonUsAddressCode Country TicketIssuedDT TicketIssuedTime
Length:307804 US : 18062 Length:307804 Length:307804
Class :character United Kingdom: 97 Class :character Class1:hms
Mode :character Singapore : 96 Mode :character Class2:difftime
Canada : 71 Mode :numeric
Australia : 22
(Other) : 136
NA's :289320
HearingDT CourtTime ViolationCode ViolDescription
Length:307804 Length:307804 9-1-36(a) :106521 Length:307804
Class :character Class1:hms 9-1-81(a) : 45036 Class :character
Mode :character Class2:difftime 9-1-104 : 38612 Mode :character
Mode :numeric 22-2-88 : 28730
22-2-88(b): 21804
9-1-110(a): 6779
(Other) : 60322
Disposition FineAmt AdminFee LateFee
Responsible By Default :172128 Min. : 0.0 Min. :20 Min. : 0.0
Not responsible By Dismissal : 53543 1st Qu.: 125.0 1st Qu.:20 1st Qu.: 10.0
Not responsible By City Dismissal: 44380 Median : 250.0 Median :20 Median : 25.0
Responsible By Admission : 16432 Mean : 360.2 Mean :20 Mean : 35.8
Responsible By Determination : 10233 3rd Qu.: 250.0 3rd Qu.:20 3rd Qu.: 25.0
Not responsible By Determination : 7809 Max. :10000.0 Max. :20 Max. :1000.0
(Other) : 3279 NA's :1973
StateFee CleanUpCost JudgmentAmt PaymentStatus
Min. :10 Min. : 0.000 Min. : 0.0 NO PAYMENT APPLIED :244303
1st Qu.:10 1st Qu.: 0.000 1st Qu.: 140.0 NO PAYMENT ON RECORD: 14565
Median :10 Median : 0.000 Median : 305.0 PAID IN FULL : 44319
Mean :10 Mean : 0.515 Mean : 425.2 PARTIAL PAYMENT MADE: 4617
3rd Qu.:10 3rd Qu.: 0.000 3rd Qu.: 305.0
Max. :10 Max. :13123.800 Max. :11030.0
NA's :1972
Void ViolationCategory ViolationAddress MailingAddress
0 : 99133 0:305787 Length:307804 Length:307804
NA's:208671 1: 2017 Class :character Class :character
Mode :character Mode :character
Getting the violation codes, we’ll manually categorize them below
violCodes <- data_violations %>%
select(ViolationCode, ViolDescription) %>%
unique()
Generated lat/longs and address, and cleaned data by keeping records with Detroit addresses only. Didn’t filter by country as it removed most of the entries (from 300k to 13k records). Didn’t remove disposition “not reposonsible” or “pending” as this could contain information
viol_list <- data_violations %>%
# filter(Country == "US") %>%
filter(grepl("\\([0-9\\.\\-]+, *[0-9\\.\\-]+\\)",ViolationAddress)) %>%
#extracting lat longs
mutate(lat = as.double(sub(".*\\(([0-9\\.\\-]+),.*","\\1", ViolationAddress))) %>%
mutate(long = as.double(sub(".*, *([0-9\\.\\-]+).*","\\1", ViolationAddress))) %>%
mutate(address_only = sub("([^\\(]+)\\([0-9\\.\\-]+,.*","\\1", ViolationAddress)) %>%
filter(grepl("Detroit",ViolationAddress)) %>%
# filter(! grepl("Not responsible",Disposition)) %>%
# filter(! grepl("PENDING", Disposition)) %>%
select(lat, long, ViolationCode, Disposition, JudgmentAmt, PaymentStatus, ViolationCategory, address_only)
head(viol_list)
What happens with Disposition? Well, it seems hat it could be categorized as responsible, not responsible, and pending
viol_list %>%
select(Disposition) %>%
group_by(Disposition) %>%
summarise(n())
There are about 80k unique lat/longs, some of them generate a huge amount of violations, here are the top 30 lat/longs
top30viols <- viol_list %>%
select(lat, long) %>%
mutate(geocord = paste(lat,long)) %>%
group_by(geocord) %>%
summarize(lat=last(lat), long=last(long), num_viols_in_geo = n()) %>%
arrange(desc(num_viols_in_geo)) %>%
head(30)
top30viols
Plotting them, we see that the top one has 21k violations and it’s in the center of Detroit, probably a standard lat/long coordinate when not the actual is not available. I’m not sure about the others with over 1000 violations…
p <- gmap(lat = 42.37, lng = -83.10, zoom = 11, width = 600, height = 350,
map_style = gmap_style("apple_mapsesque")) %>%
ly_points(long, lat, data = top30viols, hover = num_viols_in_geo,
col = 'red', alpha = pmin(num_viols_in_geo / 1000, 1)) %>%
x_axis(visible = FALSE) %>%
y_axis(visible = FALSE)
p
Are there the same number of unique addresses?
viol_list %>%
select(address_only) %>%
group_by(address_only) %>%
summarize(num_viols_in_address = n()) %>%
arrange(desc(num_viols_in_address))
Well, it seems that there are about 110k unique addresses, and 73k unique lat/longs. Let’s see how they contrast in terms of number of violations (lat/long vs. address)
viol_list_cleaned <- viol_list %>%
mutate(geocoord = paste(lat,long)) %>%
group_by(geocoord) %>%
mutate(num_viols_in_geocoord = n())
viol_list_cleaned %<>%
group_by(address_only) %>%
mutate(num_viols_in_address = n())
head(viol_list_cleaned)
It seems that about 70k entries have different number of violations when looking by address or lat/long, being 35k unique records duplicated, so getting rid of them
viol_list_cleaned %<>%
filter(! num_viols_in_address != num_viols_in_geocoord) %>%
group_by(ViolationCode, address_only) %>%
mutate(num_viols_by_vcode = n()) %>%
arrange(desc(num_viols_in_geocoord)) %>%
ungroup() %>%
unique()
head(viol_list_cleaned)
After cleaning, here are the top violations counts per geocoord/address
viol_list_cleaned %>%
select(address_only, lat, long) %>%
mutate(geocord = paste(lat,long)) %>%
group_by(geocord) %>%
summarize(address=last(address_only), lat=last(lat), long=last(long), num_viols_in_geo = n()) %>%
arrange(desc(num_viols_in_geo)) %>%
select(-geocord) %>%
head(30)
Looking at violation codes, it seems there is some common ones, taking the top 20
violCodes <- viol_list_cleaned %>%
# mutate(ViolationCode = sub("^([0-9]+-[0-9]+)-.*$","\\1",ViolationCode)) %>%
group_by(ViolationCode) %>%
tally(sort=TRUE)
violCodes
Manually categorized
violCodes_manual_categorization <-
read_csv("violCodes_manual_categorization.csv")
Parsed with column specification:
cols(
ViolGroup = col_character(),
ViolationCode = col_character(),
ViolDescription = col_character()
)
violCodes_manual_categorization %<>%
mutate(ViolGroup=as.factor(ViolGroup),
ViolationCode=as.factor(ViolationCode))
violCodes_manual_categorization
Adding a grouping factor for violation categories, ViolGroup
viol_list_cleaned %<>%
left_join(violCodes_manual_categorization,by="ViolationCode")
joining factors with different levels, coercing to character vector
#Some codes had no description, to prevent to NAs, grouping them to other
viol_list_cleaned[which(is.na(viol_list_cleaned$ViolGroup)),]$ViolGroup <- "other"
#viol_list_cleaned %<>%
# mutate(ViolGroup = ifelse(is.na(ViolGroup),"other",ViolGroup))
head(viol_list_cleaned)
Expand ViolGroup counts as separate features
violcodes_counts <- viol_list_cleaned %>%
select(address=address_only, lat, long, ViolGroup, Disposition,
JudgmentAmt, PaymentStatus) %>%
group_by(address, ViolGroup) %>%
summarize(num_viol_by_code = n()) %>%
ungroup() %>%
spread(ViolGroup, num_viol_by_code, fill = 0)
head(violcodes_counts)
Getting a list of buildings, the grouping wouldn’t be necessary as there is a one to one lat/long to address correspondance, but leaving it just in case.
bld_list_viol <- viol_list_cleaned %>%
select(address=address_only, lat, long, ViolationCode, Disposition,
JudgmentAmt, PaymentStatus) %>%
group_by(address, ViolationCode) %>%
group_by(address, Disposition) %>%
mutate(num_disposition = n()) %>%
group_by(address) %>%
mutate(num_viols = n(), max_amt = max(JudgmentAmt)) %>%
mutate(sdlat=sd(lat), sdlong=sd(long)) %>%
filter(grepl("^Responsible", Disposition)) %>%
filter((sdlat<10e-4 && sdlong<10e-4) || (is.na(sdlat) && is.na(sdlong))) %>%
filter(long > -83.3 && long < -82.8) %>%
filter(lat > 42.2 && lat < 42.5) %>%
summarise(lat=median(lat), long=median(long), num_viols = last(num_viols),
num_responsible = last(num_disposition), max_amt =last(max_amt)) %>%
unique()
head(bld_list_viol)
Adding the violation code counts
bld_list_viol %<>%
left_join(violcodes_counts, by="address")
colnames(bld_list_viol) <- make.names(colnames(bld_list_viol))
head(bld_list_viol)
Exploring 311 data
#converting to factors
cols <- c("issue_type", "ticket_status")
data_311 %<>% mutate_each_(funs(factor(.)),cols)
#converting to dates
cols <-c("ticket_closed_date_time", "acknowledged_at", "ticket_created_date_time",
"ticket_last_updated_date_time")
data_311 %<>% mutate_each_(funs(parse_date_time(.,orders="mdY HMS Op",tz="America/Detroit")),cols)
#dplyr::glimpse(data_311)
summary(data_311)
ticket_id city issue_type
Min. :1184398 Length:19680 Illegal Dumping / Illegal Dump Sites:3584
1st Qu.:1591936 Class :character Tree Issue :3546
Median :1705228 Mode :character Running Water in a Home or Building :2655
Mean :1699224 Clogged Drain :2490
3rd Qu.:1838305 Potholes :2399
Max. :1975499 Traffic Sign Issue :1030
(Other) :3976
ticket_status issue_description rating ticket_closed_date_time
Acknowledged:3012 Length:19680 Min. : 1.000 Min. :2014-07-14 16:24:49
Archived :9600 Class :character 1st Qu.: 2.000 1st Qu.:2015-04-14 13:14:10
Closed :7009 Mode :character Median : 3.000 Median :2015-06-15 13:52:22
Open : 59 Mean : 2.693 Mean :2015-06-01 22:16:27
3rd Qu.: 3.000 3rd Qu.:2015-08-13 15:31:53
Max. :19.000 Max. :2015-10-15 02:54:02
NA's :3175
acknowledged_at ticket_created_date_time ticket_last_updated_date_time
Min. :2014-07-14 14:05:07 Min. :2014-07-14 12:54:47 Min. :2014-07-14 17:06:47
1st Qu.:2015-04-16 12:46:43 1st Qu.:2015-04-14 15:37:22 1st Qu.:2015-04-21 09:06:55
Median :2015-06-15 12:58:42 Median :2015-06-10 13:55:45 Median :2015-06-22 11:44:41
Mean :2015-06-05 01:45:13 Mean :2015-06-01 17:55:42 Mean :2015-06-10 10:54:26
3rd Qu.:2015-08-11 13:51:43 3rd Qu.:2015-08-10 12:07:14 3rd Qu.:2015-08-21 09:12:26
Max. :2015-10-14 20:43:23 Max. :2015-10-15 00:32:16 Max. :2015-10-15 02:54:02
NA's :2023
address lat lng location image
Length:19680 Min. :41.88 Min. :-86.55 Length:19680 Length:19680
Class :character 1st Qu.:42.36 1st Qu.:-83.19 Class :character Class :character
Mode :character Median :42.39 Median :-83.11 Mode :character Mode :character
Mean :42.39 Mean :-83.11
3rd Qu.:42.42 3rd Qu.:-83.04
Max. :42.45 Max. :-82.91
There are 23 types of 311 issues, we could extract counts for each
types311 <- data_311 %>%
group_by(issue_type) %>%
summarise(num_311type = n()) %>%
arrange(desc(num_311type))
types311
Generating features for each issue type as a count per address
type311counts <- data_311 %>%
select(address, issue_type) %>%
group_by(address, issue_type) %>%
summarize(num_type311 = n()) %>%
ungroup() %>%
spread(issue_type, num_type311, fill = 0)
head(type311counts)
Most tickets are either archived or closed, probably there is no discrimination around that
data_311 %>%
group_by(ticket_status) %>%
summarise(num_tx_status = n()) %>%
arrange(desc(num_tx_status))
Rating could be interesting to use
data_311 %>%
group_by(rating) %>%
summarise(num_rating = n()) %>%
arrange(desc(num_rating))
All entries are for city of Detroit, which is good, but won’t be discriminating
data_311 %>%
group_by(city) %>%
summarise(num_city = n()) %>%
arrange(desc(num_city))
Grouping by address, it seems there are multiple lat/longs per address, removing those entries that have a lat/long STDEV larger than 10e-4 (~11m)
bld_list_311 <- data_311 %>%
select(address, lat, long=lng, rating, type311 = issue_type, rating) %>%
group_by(address) %>%
mutate(sdlat=sd(lat), sdlong=sd(long)) %>%
filter((sdlat<10e-4 && sdlong<10e-4) || (is.na(sdlat) && is.na(sdlong))) %>%
filter(long > -83.3 && long < -82.8) %>%
filter(lat > 42.2 && lat < 42.5) %>%
summarize(lat=median(lat), long=median(long), num_311 = n(), max_rating = max(rating), min_rating=min(rating), diff_rating = max(rating) - min(rating)) %>%
unique()
head(bld_list_311)
Most 311 entries don’t change the rating
bld_list_311 %>%
group_by(diff_rating) %>%
tally(sort=TRUE)
Adding the 311 issue type counts
bld_list_311 %<>%
left_join(type311counts, by="address")
colnames(bld_list_311) <- make.names(colnames(bld_list_311))
head(bld_list_311)
Exploring criminal incidents in Detroit
head(data_crime)
#converting to factors
cols <- c("CATEGORY","STATEOFFENSEFILECLASS","PRECINCT","COUNCIL","NEIGHBORHOOD")
data_crime %<>% mutate_each_(funs(factor(.)),cols)
#converting to dates
cols <-c("INCIDENTDATE")
data_crime %<>% mutate_each_(funs(parse_date_time(.,orders="mdY HMS Op",tz="America/Detroit")),cols)
#dplyr::glimpse(data_crime)
summary(data_crime)
ROWNUM CASEID INCINO
Min. : 1 Min. :1612272 Min. :0.000e+00
1st Qu.: 29984 1st Qu.:1930308 1st Qu.:1.504e+09
Median : 59966 Median :1960745 Median :1.506e+09
Mean : 59966 Mean :1960655 Mean :1.506e+09
3rd Qu.: 89948 3rd Qu.:1991078 3rd Qu.:1.509e+09
Max. :119931 Max. :2021621 Max. :1.509e+10
NA's :2
CATEGORY OFFENSEDESCRIPTION STATEOFFENSEFILECLASS
TRAFFIC VIOLATIONS-MOTORCYCLE VIOLATIONS:29412 Length:119931 99009 :17383
ASSAULT :16471 Class :character 13001 :11140
LARCENY :14138 Mode :character 29000 : 9418
DAMAGE TO PROPERTY : 9418 13002 : 8283
AGGRAVATED ASSAULT : 8283 24001 : 6961
BURGLARY : 7764 (Other):66738
(Other) :34445 NA's : 8
INCIDENTDATE HOUR SCA PRECINCT
Min. :2015-01-01 00:00:00 Min. : 0.00 Min. : 201.0 8 :14076
1st Qu.:2015-03-31 00:00:00 1st Qu.: 8.00 1st Qu.: 409.0 3 :13913
Median :2015-06-16 00:00:00 Median :14.00 Median : 711.0 9 :13318
Mean :2015-06-14 10:37:57 Mean :12.91 Mean : 765.1 12 :12874
3rd Qu.:2015-08-30 00:00:00 3rd Qu.:18.00 3rd Qu.:1002.0 6 :11940
Max. :2015-12-10 00:00:00 Max. :23.00 Max. :9999.0 (Other):53039
NA's :771 NA's : 771
COUNCIL NEIGHBORHOOD CENSUSTRACT
City Council District 6:18330 GREENFIELD : 3944 Min. : 9
City Council District 7:18212 STATE FAIR-NOLAN: 3942 1st Qu.: 5114
City Council District 2:16619 WARRENDALE : 3582 Median : 5263
City Council District 3:16287 DENBY : 2955 Mean : 37989
City Council District 5:16230 PEMBROKE : 2888 3rd Qu.: 5393
(Other) :30131 (Other) :97961 Max. :99992016
NA's : 4122 NA's : 4659 NA's :13064
ADDRESS LON LAT LOCATION
Length:119931 Min. : -121.0 Min. : 0.0 Length:119931
Class :character 1st Qu.: -83.2 1st Qu.: 42.4 Class :character
Mode :character Median : -83.1 Median : 42.4 Mode :character
Mean : 2236.2 Mean : 2361.4
3rd Qu.: -83.0 3rd Qu.: 42.4
Max. :999999.0 Max. :999999.0
NA's :59 NA's :59
There are 50 categories of crimes, could be added as counts to the features
crime_categories <- data_crime %>%
group_by(CATEGORY) %>%
tally(sort=TRUE)
crime_categories
Seems that crime is spread evenly, except for districts 3 and 5
data_crime %>%
group_by(COUNCIL) %>%
tally(sort=TRUE)
Grouping by address, it seems there are multiple lat/longs per address, and about 1000 entries have a lat/long STDEV larger than 10e-4 (~11m distance), so I removed those
bld_list_crime <- data_crime %>%
select(address=ADDRESS, lat=LAT, long=LON, typecrime = CATEGORY) %>%
group_by(address) %>%
mutate(sdlat=sd(lat), sdlong=sd(long)) %>%
filter((sdlat<10e-4 && sdlong<10e-4) || (is.na(sdlat) && is.na(sdlong))) %>%
filter(long > -83.3 && long < -82.8) %>%
filter(lat > 42.2 && lat < 42.5) %>%
summarize(lat=median(lat), long=median(long), num_crime = n()) %>%
unique()
head(bld_list_crime)
Generating crime features for each issue type as a count per address
crime_category_counts <- data_crime %>%
select(address=ADDRESS, CATEGORY) %>%
group_by(address, CATEGORY) %>%
summarize(num_category = n()) %>%
ungroup() %>%
spread(CATEGORY, num_category, fill = 0)
head(crime_category_counts)
Adding the crime category counts
bld_list_crime %<>%
left_join(crime_category_counts, by="address")
colnames(bld_list_crime) <- make.names(colnames(bld_list_crime))
head(bld_list_crime)
Prepare a list of coordinates by type of record, at precision of 4 digits (~11m accuracy) REF: https://en.wikipedia.org/wiki/Decimal_degrees#Precision
bld_list_coord2 <- bld_list_311 %>%
mutate(type = "311") %>%
select(type, lat, lon=long, address) %>%
bind_rows(select(mutate(bld_list_crime, type="crime"), type, lat, lon=long, address)) %>%
bind_rows(select(mutate(bld_list_viol, type="viol"), type, lat, lon=long, address)) %>%
mutate(coord=paste(lat,lon,sep=",")) %>%
arrange(desc(lat), desc(lon))
bld_list_coord2
Let’s plot all records of 311 calls, blight violations, and crime reports in a map centered in Detroit. The records are grouped by distance. When zooming in, we see that lat/long coordinates that are about +/-0.0002 degrees appart seem to belong to the same location
library(leaflet)
leaflet() %>%
setView(lat = 42.37, lng = -83.10, zoom = 11) %>%
addTiles(group = "OSM (default)") %>%
addMarkers(data = filter(bld_list_coord2,type=="311"), lng = ~lon, lat = ~lat,
clusterOptions = markerClusterOptions(), group = "311 calls",
popup = ~coord, label = ~type) %>%
addMarkers(data = filter(bld_list_coord2,type=="crime"), lng = ~lon, lat = ~lat,
clusterOptions = markerClusterOptions(), group = "crime",
popup = ~coord, label = ~type) %>%
addMarkers(data = filter(bld_list_coord2,type=="viol"), lng = ~lon, lat = ~lat,
clusterOptions = markerClusterOptions(), group = "blight viol",
popup = ~coord, label = ~type) %>%
addLayersControl(
overlayGroups = c("311 calls", "crime", "blight viol"),
options = layersControlOptions(collapsed = FALSE)
)
Data contains 12 rows with either missing or invalid lat/lon values and will be ignored
Let’s build a list of potential locations to explore by grouping all lat/long coordinates that are around +/-0.0002 degrees appart. This approach yields to 60679 buildings out of 146892 records.
# Allowable error
e_lat = 0.0002
e_lon = 0.0004
bld_list_all_filename <- "extracted_bld_list.csv"
# Calculation takes ~30 min, skip it if file above exists to speed up
if( file.exists(bld_list_all_filename) ) {
bld_list_all <- read_csv(bld_list_all_filename)
# file doesn't exists, let's crank the calculation up!
} else {
i <- 0
out <- data_frame()
input <- bld_list_coord2 %>%
mutate(coord_assigned = 0)
l <- nrow(input)
while(l > 0) {
input %<>%
filter(coord_assigned != 1)
lt <- input[1,]$lat
ln <- input[1,]$lon
input %<>%
mutate(coord_assigned = ifelse( (abs(lon-ln)<=e_lon & abs(lat-lt)<=e_lat), 1, 0) )
i <- i+1
out <- input %>%
filter(coord_assigned == 1) %>%
mutate(bld_id = i, bld_lat = median(lat), bld_lon = median(lon)) %>%
bind_rows(out)
l <- nrow(input)
if(l == 0) break
if(i%%100==0) print(paste0(i," iterations ",l," records left"))
}
bld_list_all <- out
out <- NULL
write_csv(bld_list_all, path = bld_list_all_filename)
}
Parsed with column specification:
cols(
type = col_character(),
lat = col_double(),
lon = col_double(),
address = col_character(),
coord = col_character(),
coord_assigned = col_double(),
bld_id = col_double(),
bld_lat = col_double(),
bld_lon = col_double()
)
Let’s join these babies
bld_list_crime <- bld_list_all %>%
filter(type == "crime") %>%
select(bld_id, bld_lat, bld_lon, address) %>%
left_join(bld_list_crime, by="address") %>%
select(-lat, -long) %>%
rename(lat=bld_lat, long=bld_lon)
bld_list_311 <- bld_list_all %>%
filter(type == "311") %>%
select(bld_id, bld_lat, bld_lon, address) %>%
left_join(bld_list_311, by="address") %>%
select(-lat, -long) %>%
rename(lat=bld_lat, long=bld_lon)
bld_list_viol <- bld_list_all %>%
filter(type == "viol") %>%
select(bld_id, bld_lat, bld_lon, address) %>%
left_join(bld_list_viol, by="address") %>%
select(-lat, -long) %>%
rename(lat=bld_lat, long=bld_lon)
Recalculate summaries at the building level for crime
tmp <- bld_list_crime %>%
select(-address) %>%
select(-lat, -long) %>%
group_by(bld_id) %>%
summarise_each(funs(sum))
bld_list_crime %<>%
select(bld_id, lat, long) %>%
distinct() %>%
left_join(tmp, by="bld_id")
head(bld_list_crime)
Recalculate summaries at the building level for 311
tmp <- bld_list_311 %>%
select(-address) %>%
select(-lat, -long, -min_rating, -max_rating, -diff_rating) %>%
group_by(bld_id) %>%
summarise_each(funs(sum))
tmp <- bld_list_311 %>%
select(bld_id, min_rating, max_rating) %>%
group_by(bld_id) %>%
summarise(min_rating = min(min_rating), max_rating = max(max_rating)) %>%
mutate(diff_rating = max_rating - min_rating) %>%
left_join(tmp, by="bld_id")
bld_list_311 %<>%
select(bld_id, lat, long) %>%
distinct() %>%
left_join(tmp, by="bld_id")
head(bld_list_311)
Recalculate summaries at the building level for blight violations
tmp <- bld_list_viol %>%
select(-address) %>%
select(-lat, -long, -max_amt) %>%
group_by(bld_id) %>%
summarise_each(funs(sum))
tmp <- bld_list_viol %>%
select(bld_id, max_amt) %>%
group_by(bld_id) %>%
summarise(max_amt = max(max_amt)) %>%
left_join(tmp, by="bld_id")
bld_list_viol %<>%
select(bld_id, lat, long) %>%
distinct() %>%
left_join(tmp, by="bld_id")
head(bld_list_viol)
Direct join by building identifier, bld_id
bld_list_crime_311_viol <- bld_list_crime %>%
full_join(bld_list_311, by = "bld_id") %>%
full_join(bld_list_viol, by = "bld_id")
bld_list_crime_311_viol %<>%
mutate(lat = ifelse(is.na(lat),ifelse(is.na(lat.x),lat.y,lat.x),lat)) %>%
mutate(long = ifelse(is.na(long),ifelse(is.na(long.x),long.y,long.x),long)) %>%
select(-lat.x, -long.x, -lat.y, -long.y) %>%
select(bld_id, lat, long, everything())
head(bld_list_crime_311_viol)
Adding surrounding statistics, 0.001 ~ 111m
bld_list_crime_311_viol %<>%
mutate(coord_neigh = paste(round(lat,digits=3),round(lat,digits=3))) %>%
group_by(coord_neigh) %>%
mutate(
num_crime_neigh = sum(num_crime),
num_311_neigh = sum(num_311),
max_rating_neigh = max(max_rating),
min_rating_neigh = min(min_rating),
num_viols_neigh = sum(num_viols),
num_respons_neigh = sum(num_responsible),
avg_max_amt_neigh = mean(max_amt)
) %>%
ungroup()
head(bld_list_crime_311_viol)
Taking care of NA’s, doing some dummy imputation
codesviol <- make.names(unlist(lapply(unique(violCodes_manual_categorization$ViolGroup), as.character)))
codes311 <- make.names(unlist(lapply(types311$issue_type, as.character)))
codescrime <- make.names(unlist(lapply(crime_categories$CATEGORY, as.character)))
cols <- c("num_crime", "num_311", "num_viols", "num_responsible",
"num_crime_neigh", "num_311_neigh", "num_viols_neigh", "num_respons_neigh")
cols <- c(cols, codesviol, codes311, codescrime)
bld_list_crime_311_viol %<>%
mutate_each_(funs(ifelse(is.na(.),0,.)),cols)
cols <- c("max_rating", "min_rating", "diff_rating", "max_amt",
"max_rating_neigh", "min_rating_neigh", "avg_max_amt_neigh")
bld_list_crime_311_viol %<>% mutate_each_(funs(ifelse(is.na(.),-1,.)),cols)
head(bld_list_crime_311_viol)
Faster function to assign blight by computing distance directly in degrees, 0.0001 ~ 11m ~ 37ft
in_blight <- function(lt, ln, data) {
data %>%
filter(sqrt((lat-lt)^2+(long-ln)^2)<rdegr+0.0001) %>%
nrow()
}
indata_degrees <- indata %>%
mutate(rdegr = 0.0001/37 * r)
The mashup results in about 3369 blighted entries, out of 60658 total buildings.
modeling_data <- bld_list_crime_311_viol
modeling_data %<>%
rowwise() %>%
mutate(nblighted = in_blight(lat, long, indata_degrees))
modeling_data %>%
filter(nblighted > 0) %>%
nrow()
[1] 3396
Creating dependent variable “condition” as boolean from “nblighted” (this describes the number of records blighted in the same lat/long)
modeling_data %<>%
mutate(condition = factor(if_else(nblighted > 0, "BLIGHTED", "NOT_BLIGHTED"))) %>%
select(-bld_id, -lat, -long, -nblighted, -coord_neigh)
table(modeling_data$condition)
NOT_BLIGHTED BLIGHTED
57262 3396
Balancing the two classes by downsampling the number of cases of non-blight (not ideal, but helps modeling and compute time)
mod_data_sampled <- modeling_data %>%
filter(condition == "BLIGHTED")
mod_data_sampled <- modeling_data %>%
filter(condition == "NOT_BLIGHTED") %>%
sample_n(5500) %>%
bind_rows(mod_data_sampled)
table(mod_data_sampled$condition)
NOT_BLIGHTED BLIGHTED
5500 3396
Creating a set-aside dataset for test, and a modeling set
set.seed(107)
inTest <- createDataPartition(y=mod_data_sampled$condition, p=0.1, list=FALSE)
testSet <- mod_data_sampled[inTest,]
modelSet <- mod_data_sampled[-inTest,]
table(testSet$condition)
NOT_BLIGHTED BLIGHTED
550 340
Creating a training and validation data sets for model training and validation, the training set is about 20k records
set.seed(107)
inValid <- createDataPartition(y=modelSet$condition, p=0.15, list=FALSE)
trainSet <- mod_data_sampled[-inValid,]
validSet <- mod_data_sampled[inValid,]
table(trainSet$condition)
NOT_BLIGHTED BLIGHTED
4666 3028
As for the validation set, it is about 3k records
table(validSet$condition)
NOT_BLIGHTED BLIGHTED
834 368
Plotting the histogram of number of violations for both blighted and not blighted buildings, we see some difference that might contribute to discriminate the 2 classes
p <- figure(width = 600, height = 350) %>%
ly_hist(num_viols, data = trainSet[which(trainSet$condition == "NOT_BLIGHTED"),], color = "blue", alpha = 0.25, freq = F, breaks = 25) %>%
ly_hist(num_viols, data = trainSet[which(trainSet$condition == "BLIGHTED"),], color = "red", alpha = 0.25, freq = F, breaks = 25)
p
Let’s train a general linear model (GLM) with 5-fold cross-validation trying to predict the bligth condition only by the number of violations reported in the coordinate, which yields 67.6% AUC on the ROC, 76% true positive, and 43% true negative.
# 5 fold cross-validation, 1 repetition
ctrl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 1,
classProbs = TRUE,
summaryFunction = twoClassSummary
)
# general linear model
glmModel <- train(condition ~ num_viols,
data = trainSet,
method = "glm",
trControl = ctrl,
metric = "ROC"
)
glmModel
Generalized Linear Model
7694 samples
1 predictor
2 classes: 'NOT_BLIGHTED', 'BLIGHTED'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 1 times)
Summary of sample sizes: 6156, 6155, 6155, 6156, 6154
Resampling results:
ROC Sens Spec
0.6988288 0.8821288 0.2757598
Let’s check the results in the validation set
glmVal <- predict(glmModel, newdata = validSet)
confusionMatrix(glmVal, validSet$condition, positive = "BLIGHTED")
Confusion Matrix and Statistics
Reference
Prediction NOT_BLIGHTED BLIGHTED
NOT_BLIGHTED 719 299
BLIGHTED 115 69
Accuracy : 0.6556
95% CI : (0.6279, 0.6824)
No Information Rate : 0.6938
P-Value [Acc > NIR] : 0.998
Kappa : 0.0577
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.1875
Specificity : 0.8621
Pos Pred Value : 0.3750
Neg Pred Value : 0.7063
Prevalence : 0.3062
Detection Rate : 0.0574
Detection Prevalence : 0.1531
Balanced Accuracy : 0.5248
'Positive' Class : BLIGHTED
A decision tree has a similar performance, 67% AUC on ROC for cross-validation
# decision tree model
dtreeModel <- train(condition ~ num_viols,
data = trainSet,
method = "rpart",
trControl = ctrl,
metric = "ROC"
)
Loading required package: rpart
dtreeModel
CART
7694 samples
1 predictor
2 classes: 'NOT_BLIGHTED', 'BLIGHTED'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 1 times)
Summary of sample sizes: 6155, 6155, 6156, 6155, 6155
Resampling results across tuning parameters:
cp ROC Sens Spec
0.0000000000 0.6884173 0.7904081 0.3986002
0.0000660502 0.6884173 0.7904081 0.3986002
0.0516842801 0.6076208 0.9065720 0.1942716
ROC was used to select the optimal model using the largest value.
The final value used for the model was cp = 6.60502e-05.
Let’s check the results in the validation set
dtreeVal <- predict(dtreeModel, newdata = validSet)
confusionMatrix(dtreeVal, validSet$condition, positive = "BLIGHTED")
Confusion Matrix and Statistics
Reference
Prediction NOT_BLIGHTED BLIGHTED
NOT_BLIGHTED 696 279
BLIGHTED 138 89
Accuracy : 0.6531
95% CI : (0.6254, 0.68)
No Information Rate : 0.6938
P-Value [Acc > NIR] : 0.9989
Kappa : 0.0855
Mcnemar's Test P-Value : 7.09e-12
Sensitivity : 0.24185
Specificity : 0.83453
Pos Pred Value : 0.39207
Neg Pred Value : 0.71385
Prevalence : 0.30616
Detection Rate : 0.07404
Detection Prevalence : 0.18885
Balanced Accuracy : 0.53819
'Positive' Class : BLIGHTED
A random forest doesn’t yield significantly different results, with an 66.3% AUC on ROC
# random forest model
rfModel <- train(condition ~ num_viols,
data = trainSet,
method = "rf",
trControl = ctrl,
metric = "ROC"
)
Loading required package: randomForest
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.
Attaching package: ‘randomForest’
The following object is masked from ‘package:ggplot2’:
margin
The following object is masked from ‘package:dplyr’:
combine
invalid mtry: reset to within valid rangeinvalid mtry: reset to within valid rangeinvalid mtry: reset to within valid rangeinvalid mtry: reset to within valid rangeinvalid mtry: reset to within valid rangeinvalid mtry: reset to within valid range
rfModel
Random Forest
7694 samples
1 predictor
2 classes: 'NOT_BLIGHTED', 'BLIGHTED'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 1 times)
Summary of sample sizes: 6155, 6155, 6156, 6155, 6155
Resampling results:
ROC Sens Spec
0.6636747 0.798765 0.3847448
Tuning parameter 'mtry' was held constant at a value of 2
Let’s check the results in the training set
rfVal <- predict(rfModel, newdata = trainSet)
confusionMatrix(rfVal, trainSet$condition, positive = "BLIGHTED")
Confusion Matrix and Statistics
Reference
Prediction NOT_BLIGHTED BLIGHTED
NOT_BLIGHTED 4009 2051
BLIGHTED 657 977
Accuracy : 0.648
95% CI : (0.6372, 0.6587)
No Information Rate : 0.6064
P-Value [Acc > NIR] : 3.04e-14
Kappa : 0.1978
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.3227
Specificity : 0.8592
Pos Pred Value : 0.5979
Neg Pred Value : 0.6616
Prevalence : 0.3936
Detection Rate : 0.1270
Detection Prevalence : 0.2124
Balanced Accuracy : 0.5909
'Positive' Class : BLIGHTED
Let’s check the results in the validation set
rfVal <- predict(rfModel, newdata = validSet)
confusionMatrix(rfVal, validSet$condition, positive = "BLIGHTED")
Confusion Matrix and Statistics
Reference
Prediction NOT_BLIGHTED BLIGHTED
NOT_BLIGHTED 700 280
BLIGHTED 134 88
Accuracy : 0.6556
95% CI : (0.6279, 0.6824)
No Information Rate : 0.6938
P-Value [Acc > NIR] : 0.998
Kappa : 0.0882
Mcnemar's Test P-Value : 1.031e-12
Sensitivity : 0.23913
Specificity : 0.83933
Pos Pred Value : 0.39640
Neg Pred Value : 0.71429
Prevalence : 0.30616
Detection Rate : 0.07321
Detection Prevalence : 0.18469
Balanced Accuracy : 0.53923
'Positive' Class : BLIGHTED
Comparing the models across cross-validation resamples, we see a similar performance in terms of ROC.
resamps <- resamples(list(glm = glmModel, dtree = dtreeModel, rf = rfModel))
summary(resamps)
Call:
summary.resamples(object = resamps)
Models: glm, dtree, rf
Number of resamples: 5
ROC
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
glm 0.6853 0.6927 0.6970 0.6988 0.6995 0.7196 0
dtree 0.6715 0.6876 0.6931 0.6884 0.6943 0.6955 0
rf 0.6517 0.6528 0.6613 0.6637 0.6647 0.6880 0
Sens
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
glm 0.8694 0.8767 0.8800 0.8821 0.8875 0.8971 0
dtree 0.7460 0.7463 0.7889 0.7904 0.8242 0.8467 0
rf 0.7612 0.7792 0.7803 0.7988 0.8028 0.8703 0
Spec
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
glm 0.2525 0.2529 0.2822 0.2758 0.2921 0.2992 0
dtree 0.3289 0.3746 0.4076 0.3986 0.4298 0.4521 0
rf 0.3289 0.3548 0.3828 0.3847 0.4092 0.4479 0
Other features
varcrime <- "num_crime"
var311 <- paste("min_rating", "max_rating", "diff_rating","num_311",sep = "+")
varviols <- paste("max_amt","num_viols","num_responsible",sep = "+")
varneigh <- paste("num_crime_neigh", "num_311_neigh","max_rating_neigh", "min_rating_neigh",
"num_viols_neigh","num_respons_neigh","avg_max_amt_neigh",sep="+")
var311codes <- paste(codes311, collapse = "+")
varcrimecodes <- paste(codescrime, collapse = "+")
varviolcodes <- paste(codesviol, collapse = "+")
f_basic <- as.formula(paste("condition ~ ", paste(varcrime, var311, varviols, sep="+")))
f_basic_neigh <- as.formula(paste("condition ~ ", paste(varcrime, var311, varviols, varneigh, sep="+")))
f_all <- as.formula(paste("condition ~ ", paste(varcrime, var311, varviols, varneigh, var311codes, varcrimecodes, varviolcodes, sep="+")))
The fact that the random forest performance in training and validation sets was similar, eludes that the model in not complex enough.
Adding 311 call data and
# decision tree model
dtreeModel2 <- train(f_basic,
data = trainSet,
method = "rpart",
trControl = ctrl,
metric = "ROC",
na.action=na.exclude
)
dtreeModel2
CART
7694 samples
8 predictor
2 classes: 'NOT_BLIGHTED', 'BLIGHTED'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 1 times)
Summary of sample sizes: 6156, 6155, 6155, 6155, 6155
Resampling results across tuning parameters:
cp ROC Sens Spec
0.001981506 0.6964660 0.7196669 0.54129286
0.002311757 0.6957292 0.7194533 0.53864441
0.063408190 0.5366942 0.9498392 0.09834983
ROC was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.001981506.
Let’s add the neighborhood features
# decision tree model
dtreeModel3 <- train(f_basic_neigh,
data = trainSet,
method = "rpart",
trControl = ctrl,
metric = "ROC",
na.action=na.exclude
)
dtreeModel3
CART
7694 samples
15 predictor
2 classes: 'NOT_BLIGHTED', 'BLIGHTED'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 1 times)
Summary of sample sizes: 6155, 6155, 6155, 6156, 6155
Resampling results across tuning parameters:
cp ROC Sens Spec
0.001981506 0.7017059 0.7297560 0.5402891
0.002311757 0.7026690 0.7329689 0.5366506
0.063408190 0.6116912 0.8782783 0.2570177
ROC was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.002311757.
Let’s add the all features
# decision tree model
dtreeModel4 <- train(f_all,
data = trainSet,
method = "rpart",
trControl = ctrl,
metric = "ROC",
na.action=na.exclude
)
dtreeModel4
CART
7694 samples
100 predictor
2 classes: 'NOT_BLIGHTED', 'BLIGHTED'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 1 times)
Summary of sample sizes: 6155, 6155, 6156, 6154, 6156
Resampling results across tuning parameters:
cp ROC Sens Spec
0.003632761 0.6991438 0.7378916 0.5234209
0.005779392 0.6906405 0.7633973 0.4801658
0.063408190 0.6117785 0.8763481 0.2545864
ROC was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.003632761.
resamps <- resamples(list(dtree_violonly = dtreeModel, dtree_plus311crime = dtreeModel2, dtree_plusneigh = dtreeModel3, dtree_pluscodes = dtreeModel4))
summary(resamps)
Call:
summary.resamples(object = resamps)
Models: dtree_violonly, dtree_plus311crime, dtree_plusneigh, dtree_pluscodes
Number of resamples: 5
ROC
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
dtree_violonly 0.6715 0.6876 0.6931 0.6884 0.6943 0.6955 0
dtree_plus311crime 0.6828 0.6865 0.6891 0.6965 0.6975 0.7265 0
dtree_plusneigh 0.6920 0.6984 0.7037 0.7027 0.7078 0.7115 0
dtree_pluscodes 0.6879 0.6939 0.6977 0.6991 0.7068 0.7094 0
Sens
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
dtree_violonly 0.7460 0.7463 0.7889 0.7904 0.8242 0.8467 0
dtree_plus311crime 0.6742 0.6774 0.6988 0.7197 0.7537 0.7942 0
dtree_plusneigh 0.6752 0.6935 0.7002 0.7330 0.7610 0.8349 0
dtree_pluscodes 0.6249 0.7355 0.7567 0.7379 0.7685 0.8039 0
Spec
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
dtree_violonly 0.3289 0.3746 0.4076 0.3986 0.4298 0.4521 0
dtree_plus311crime 0.4323 0.5421 0.5429 0.5413 0.5752 0.6139 0
dtree_plusneigh 0.4059 0.4694 0.5818 0.5367 0.6089 0.6172 0
dtree_pluscodes 0.4505 0.4595 0.4967 0.5234 0.5058 0.7046 0
Let’s try a random forest with all the variables
# random forest model
rfModel2 <- train(condition ~ .,
data = trainSet,
method = "rf",
trControl = ctrl,
metric = "ROC"
)
rfModel2
Random Forest
7694 samples
100 predictor
2 classes: 'NOT_BLIGHTED', 'BLIGHTED'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 1 times)
Summary of sample sizes: 6155, 6156, 6155, 6155, 6155
Resampling results across tuning parameters:
mtry ROC Sens Spec
2 0.7175842 0.9228461 0.2004577
51 0.6851794 0.7473210 0.4999989
100 0.6810180 0.7393912 0.4983455
ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
rf2Val <- predict(rfModel2, newdata = validSet)
confusionMatrix(rf2Val, validSet$condition, positive = "BLIGHTED")
Confusion Matrix and Statistics
Reference
Prediction NOT_BLIGHTED BLIGHTED
NOT_BLIGHTED 779 335
BLIGHTED 55 33
Accuracy : 0.6755
95% CI : (0.6483, 0.702)
No Information Rate : 0.6938
P-Value [Acc > NIR] : 0.9199
Kappa : 0.0301
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.08967
Specificity : 0.93405
Pos Pred Value : 0.37500
Neg Pred Value : 0.69928
Prevalence : 0.30616
Detection Rate : 0.02745
Detection Prevalence : 0.07321
Balanced Accuracy : 0.51186
'Positive' Class : BLIGHTED
rf2varImp <- varImp(rfModel2, scale = FALSE)
plot(rf2varImp, top=28)
Retraining Random Forest with only most important features
trainSet2 <- trainSet %>% select(max_amt, num_responsible, num_viols, compliance, hazard, num_crime, waste, other, condition)
# random forest model
rfModel3 <- train(condition ~ .,
data = trainSet2,
method = "rf",
trControl = ctrl,
metric = "ROC"
)
rfModel3
Random Forest
7694 samples
8 predictor
2 classes: 'NOT_BLIGHTED', 'BLIGHTED'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 6156, 6154, 6155, 6156, 6155
Resampling results across tuning parameters:
mtry ROC Sens Spec
2 0.7002403 0.7571796 0.4752443
5 0.6886701 0.7423909 0.4920825
8 0.6855091 0.7387472 0.4907591
ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
dtrain <- Matrix::sparse.model.matrix(condition ~ ., data=trainSet)
deval <- Matrix::sparse.model.matrix(condition ~ ., data=validSet)
grid <- expand.grid(nrounds = c(300, 400, 500, 600),
max_depth = c(3, 5, 7, 9),
eta = c(0.05, 0.1, 0.2, 0.3),
gamma = 1,
colsample_bytree = c(0.5, 1),
min_child_weight = 1,
subsample = 1)
ctrl <- trainControl(method = "cv",
number = 5,
summaryFunction = twoClassSummary,
classProbs = TRUE,
allowParallel = TRUE )
set.seed(107)
system.time(xgbTune <- train(x = dtrain,
y = factor(trainSet$condition),
method = "xgbTree",
metric = "ROC",
tuneGrid = grid,
verbose = TRUE,
trControl = ctrl))
The training data could not be converted to a data frame for saving
user system elapsed
1894.059 15.966 530.916
Gradient boosting reaches the best performance in crossvalidation, of 72.6% ROC/AUC, with a 73.9% true positive, and 55.5% true negative rates. The optimal hyperparmeters being: nrounds = 300, max_depth = 3, eta = 0.3, gamma = 1, colsample_bytree = 1, min_child_weight = 1 and subsample = 1.
xgbTune
eXtreme Gradient Boosting
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 6155, 6155, 6155, 6155, 6156
Resampling results across tuning parameters:
eta max_depth colsample_bytree nrounds ROC Sens Spec
0.05 3 0.5 300 0.6177486 0.9972137 0.001981289
0.05 3 0.5 400 0.6177486 0.9972137 0.001981289
0.05 3 0.5 500 0.6177486 0.9972137 0.001981289
0.05 3 0.5 600 0.6177486 0.9972137 0.001981289
0.05 3 1.0 300 0.7249051 0.7385370 0.561761994
0.05 3 1.0 400 0.7249049 0.7385370 0.561761994
0.05 3 1.0 500 0.7249049 0.7385370 0.561761994
0.05 3 1.0 600 0.7249049 0.7385370 0.561761994
0.05 5 0.5 300 0.6152627 0.9920688 0.005943867
0.05 5 0.5 400 0.6152627 0.9920688 0.005943867
0.05 5 0.5 500 0.6152627 0.9920688 0.005943867
0.05 5 0.5 600 0.6152627 0.9920688 0.005943867
0.05 5 1.0 300 0.7211144 0.7417485 0.549217467
0.05 5 1.0 400 0.7208059 0.7408911 0.549877533
0.05 5 1.0 500 0.7208058 0.7408911 0.549877533
0.05 5 1.0 600 0.7208058 0.7408911 0.549877533
0.05 7 0.5 300 0.6120924 0.9873533 0.010897090
0.05 7 0.5 400 0.6120924 0.9873533 0.010897090
0.05 7 0.5 500 0.6120924 0.9873533 0.010897090
0.05 7 0.5 600 0.6120924 0.9873533 0.010897090
0.05 7 1.0 300 0.7129869 0.7383194 0.539633963
0.05 7 1.0 400 0.7128803 0.7389625 0.538643864
0.05 7 1.0 500 0.7128803 0.7389625 0.538643864
0.05 7 1.0 600 0.7128803 0.7389625 0.538643864
0.05 9 0.5 300 0.6105243 0.9847810 0.012217767
0.05 9 0.5 400 0.6105243 0.9847810 0.012217767
0.05 9 0.5 500 0.6105243 0.9847810 0.012217767
0.05 9 0.5 600 0.6105243 0.9847810 0.012217767
0.05 9 1.0 300 0.7109585 0.7411011 0.536984971
0.05 9 1.0 400 0.7109585 0.7411011 0.536984971
0.05 9 1.0 500 0.7109585 0.7411011 0.536984971
0.05 9 1.0 600 0.7109585 0.7411011 0.536984971
0.10 3 0.5 300 0.6172063 0.9972135 0.002311322
0.10 3 0.5 400 0.6172063 0.9972135 0.002311322
0.10 3 0.5 500 0.6172063 0.9972135 0.002311322
0.10 3 0.5 600 0.6172063 0.9972135 0.002311322
0.10 3 1.0 300 0.7240996 0.7387488 0.560108011
0.10 3 1.0 400 0.7240996 0.7387488 0.560108011
0.10 3 1.0 500 0.7240996 0.7387488 0.560108011
0.10 3 1.0 600 0.7240996 0.7387488 0.560108011
0.10 5 0.5 300 0.6140730 0.9920693 0.007924611
0.10 5 0.5 400 0.6140730 0.9920693 0.007924611
0.10 5 0.5 500 0.6140730 0.9920693 0.007924611
0.10 5 0.5 600 0.6140730 0.9920693 0.007924611
0.10 5 1.0 300 0.7188853 0.7398213 0.548896708
0.10 5 1.0 400 0.7188853 0.7398213 0.548896708
0.10 5 1.0 500 0.7188853 0.7398213 0.548896708
0.10 5 1.0 600 0.7188853 0.7398213 0.548896708
0.10 7 0.5 300 0.6111813 0.9869248 0.009575867
0.10 7 0.5 400 0.6111813 0.9869248 0.009575867
0.10 7 0.5 500 0.6111813 0.9869248 0.009575867
0.10 7 0.5 600 0.6111813 0.9869248 0.009575867
0.10 7 1.0 300 0.7121805 0.7400366 0.531709898
0.10 7 1.0 400 0.7121805 0.7400366 0.531709898
0.10 7 1.0 500 0.7121805 0.7400366 0.531709898
0.10 7 1.0 600 0.7121805 0.7400366 0.531709898
0.10 9 0.5 300 0.6093173 0.9815660 0.013538990
0.10 9 0.5 400 0.6093173 0.9815660 0.013538990
0.10 9 0.5 500 0.6093173 0.9815660 0.013538990
0.10 9 0.5 600 0.6093173 0.9815660 0.013538990
0.10 9 1.0 300 0.7068460 0.7415328 0.524102774
0.10 9 1.0 400 0.7068460 0.7415328 0.524102774
0.10 9 1.0 500 0.7068460 0.7415328 0.524102774
0.10 9 1.0 600 0.7068462 0.7415328 0.524102774
0.20 3 0.5 300 0.6175922 0.9957134 0.003962578
0.20 3 0.5 400 0.6175922 0.9957134 0.003962578
0.20 3 0.5 500 0.6175922 0.9957134 0.003962578
0.20 3 0.5 600 0.6175922 0.9957134 0.003962578
0.20 3 1.0 300 0.7231558 0.7383224 0.558456755
0.20 3 1.0 400 0.7231558 0.7383224 0.558456755
0.20 3 1.0 500 0.7231558 0.7383224 0.558456755
0.20 3 1.0 600 0.7231558 0.7383224 0.558456755
0.20 5 0.5 300 0.6116664 0.9884251 0.008914710
0.20 5 0.5 400 0.6116664 0.9884251 0.008914710
0.20 5 0.5 500 0.6116664 0.9884251 0.008914710
0.20 5 0.5 600 0.6116664 0.9884251 0.008914710
0.20 5 1.0 300 0.7173222 0.7426016 0.543929302
0.20 5 1.0 400 0.7173222 0.7426016 0.543929302
0.20 5 1.0 500 0.7173222 0.7426016 0.543929302
0.20 5 1.0 600 0.7173222 0.7426016 0.543929302
0.20 7 0.5 300 0.6119980 0.9858530 0.012217767
0.20 7 0.5 400 0.6119980 0.9858530 0.012217767
0.20 7 0.5 500 0.6119980 0.9858530 0.012217767
0.20 7 0.5 600 0.6119980 0.9858530 0.012217767
0.20 7 1.0 300 0.7119049 0.7456041 0.536663666
0.20 7 1.0 400 0.7119049 0.7456041 0.536663666
0.20 7 1.0 500 0.7119049 0.7456041 0.536663666
0.20 7 1.0 600 0.7119049 0.7456041 0.536663666
0.20 9 0.5 300 0.6110154 0.9830670 0.014859668
0.20 9 0.5 400 0.6110154 0.9830670 0.014859668
0.20 9 0.5 500 0.6110154 0.9830670 0.014859668
0.20 9 0.5 600 0.6110154 0.9830670 0.014859668
0.20 9 1.0 300 0.7032900 0.7391776 0.523125222
0.20 9 1.0 400 0.7032900 0.7391776 0.523125222
0.20 9 1.0 500 0.7032900 0.7391776 0.523125222
0.20 9 1.0 600 0.7032900 0.7391776 0.523125222
0.30 3 0.5 300 0.6177274 0.9948560 0.004954859
0.30 3 0.5 400 0.6177274 0.9948560 0.004954859
0.30 3 0.5 500 0.6177274 0.9948560 0.004954859
0.30 3 0.5 600 0.6177274 0.9948560 0.004954859
0.30 3 1.0 300 0.7255297 0.7387500 0.555483730
0.30 3 1.0 400 0.7255297 0.7387500 0.555483730
0.30 3 1.0 500 0.7255297 0.7387500 0.555483730
0.30 3 1.0 600 0.7255297 0.7387500 0.555483730
0.30 5 0.5 300 0.6123241 0.9886397 0.008585222
0.30 5 0.5 400 0.6123241 0.9886397 0.008585222
0.30 5 0.5 500 0.6123241 0.9886397 0.008585222
0.30 5 0.5 600 0.6123241 0.9886397 0.008585222
0.30 5 1.0 300 0.7116680 0.7383206 0.546244443
0.30 5 1.0 400 0.7116680 0.7383206 0.546244443
0.30 5 1.0 500 0.7116680 0.7383206 0.546244443
0.30 5 1.0 600 0.7116680 0.7383206 0.546244443
0.30 7 0.5 300 0.6095226 0.9837101 0.011557156
0.30 7 0.5 400 0.6095226 0.9837101 0.011557156
0.30 7 0.5 500 0.6095226 0.9837101 0.011557156
0.30 7 0.5 600 0.6095226 0.9837101 0.011557156
0.30 7 1.0 300 0.7059779 0.7314615 0.535349535
0.30 7 1.0 400 0.7059779 0.7314615 0.535349535
0.30 7 1.0 500 0.7059779 0.7314615 0.535349535
0.30 7 1.0 600 0.7059779 0.7314615 0.535349535
0.30 9 0.5 300 0.6050948 0.9766370 0.015189701
0.30 9 0.5 400 0.6050948 0.9766370 0.015189701
0.30 9 0.5 500 0.6050948 0.9766370 0.015189701
0.30 9 0.5 600 0.6050948 0.9766370 0.015189701
0.30 9 1.0 300 0.7016877 0.7421743 0.518161634
0.30 9 1.0 400 0.7016877 0.7421743 0.518161634
0.30 9 1.0 500 0.7016877 0.7421743 0.518161634
0.30 9 1.0 600 0.7016877 0.7421743 0.518161634
Tuning parameter 'gamma' was held constant at a value of 1
Tuning
parameter 'min_child_weight' was held constant at a value of 1
Tuning
parameter 'subsample' was held constant at a value of 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were nrounds = 300, max_depth = 3, eta = 0.3, gamma
= 1, colsample_bytree = 1, min_child_weight = 1 and subsample = 1.
ggplot(xgbTune) +
theme(legend.position = "top")
Ignoring unknown aesthetics: shape
trainlab <- trainSet %>%
select(condition) %>%
mutate(lab = ifelse((condition=="BLIGHTED"),1,0)) %>%
select(lab)
param <- list(objective = 'binary:logistic',
eval_metric = 'auc',
max_depth = 3,
eta = 0.3,
gamma = 1,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1)
set.seed(107)
system.time(xgb <- xgboost(params = param,
data = dtrain,
label = trainlab$lab,
nrounds = 300,
print_every_n = 100,
verbose = 1))
[1] train-auc:0.713091
[101] train-auc:0.768482
[201] train-auc:0.768482
[300] train-auc:0.768482
user system elapsed
1.354 0.013 1.379
model <- xgb.dump(xgb, with_stats = TRUE)
names <- dimnames(dtrain)[[2]]
importance_matrix <- xgb.importance(names, model = xgb)[0:20]
xgb.plot.importance(importance_matrix)
Let’s check the results in the validation set
validlab <- validSet %>%
select(condition) %>%
mutate(lab = ifelse((condition=="BLIGHTED"),1,0)) %>%
select(lab)
xgbVal <- predict(xgb, newdata = deval)
xgbVal.resp <- ifelse(xgbVal > 0.3, 1, 0)
confusionMatrix(xgbVal.resp, validlab$lab, positive = '1')
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 384 81
1 450 287
Accuracy : 0.5582
95% CI : (0.5296, 0.5866)
No Information Rate : 0.6938
P-Value [Acc > NIR] : 1
Kappa : 0.1877
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.7799
Specificity : 0.4604
Pos Pred Value : 0.3894
Neg Pred Value : 0.8258
Prevalence : 0.3062
Detection Rate : 0.2388
Detection Prevalence : 0.6131
Balanced Accuracy : 0.6202
'Positive' Class : 1
Plotting the ROC, we can explore different values of true positive and false positive rates.
xgb.pred <- ROCR::prediction(xgbVal, validlab$lab)
xgb.perf <- ROCR::performance(xgb.pred, "tpr", "fpr")
plot(xgb.perf,
avg="threshold",
colorize=TRUE,
lwd=3,
print.cutoffs.at=seq(0, 1, by=0.05),
text.adj=c(-0.5, 0.5),
text.cex=0.6)
grid(col="lightgray")
axis(1, at=seq(0, 1, by=0.1))
axis(2, at=seq(0, 1, by=0.1))
abline(v=c(0.1, 0.3, 0.5, 0.7, 0.9), col="lightgray", lty="dotted")
abline(h=c(0.1, 0.3, 0.5, 0.7, 0.9), col="lightgray", lty="dotted")
lines(x=c(0, 1), y=c(0, 1), col="black", lty="dotted")
Other data to explore in the future: - Neighboorhood/region - Demographic data - MLS / Zillow (last time sold, price, # sold houses around) - Detroit Parcel data